Capture Cookbook 📈👩‍🍳

1 About the Capture Cookbook

The purpose of this document is to provide a resource for anyone at SHARE Creative or Capture Intelligence to understand the commonly used visualisations produced for client work, that is appreciate when they should be used and how to interpret them.

This is nicely summarised by the BBC in their graphics cookbook:

A reference manual, rather than a tutorial, it might not tell you how to make your very first chart in R, but is a useful collection of little tips and tricks.

Indeed, this document will not teach you R, or how to physically make these plots. Rather it’s goal is to help understand how and when visualisations should be used in our work and empower anyone with the confidence to explain key meaning, strengths, and weaknesses of different plots.

2 Specific Visualisation Purposes

2.1 Trend Over Time

Perhaps the simplest plot we would make would be for the purpose of understanding how a metric or variable has changed over time.

At a high level, this could be when analysing volume, proportion, or percentage over time.

2.1.1 Example

Line chart showing daily trend

Figure 2.1: Line chart showing daily trend

2.1.2 Intepretation

The interpretation of a line chart is very simple- but we must always be careful to look at the axes to understand what is being shown. Is the y axis raw values of proportion? Does our x axis encompass a year or a month?

One thing to be wary of is that for trend plots, we invariably have to bin the data into more accessible intervals of dates (especially as Sprinklr outputs ‘Created Time’ to the nearest second, and it is unlikely many posts are posted at exactly the same second). Depending on which time intervals are chosen, the interpretation of the plots can vary. Choosing an inappropriate interval can obscure trends or make them appear more dramatic than they actually are. In the example above, the data was binned on a daily basis; however we can see below how the plot changes when we treat the data at a weekly scale- we do not get such a nuanced view of fluctuations in volume.

Line chart showing weekly trend

Figure 2.2: Line chart showing weekly trend

We can also plot the same data using a bar chart rather than a line chart. The interpretation is the same, however they may not be as effective in showing trends because the bars are discrete rather than continuous, and it emphasises the magnitude of each individual interval rather than the total trend.

As such, bar charts are more useful when the absolute value of a data point is relevant, or perhaps when we already have line charts included on the plot and we want to show an additional value (i.e. we already have proportion plotted with time, we might use a bar to represent volume)

Bar chart showing weekly trend

Figure 2.3: Bar chart showing weekly trend

2.2 Comparing categories

Often we need to see how different groups or categories compare to each other in terms of a specific metric or variable.

For example we might want to compare the volume of posts per certain topic. The key thing here is that these variables we are measuring are distinct groups and they may not have a logical order (in the case of Platform, for example). In such cases, a bar chart is often an appropriate and popular way to go.

2.2.1 Example

Bar chart showing the number of messages per category

Figure 2.4: Bar chart showing the number of messages per category

2.2.1.1 Intepretation

Similar to the trends over time chart above, the interpretation of a classic bar chart is very simple. However to help storytelling there are key changes we can make to improve the clarity of such a plot.

One of these includes reordering the categories on the x axis. This could perhaps be in alphabetical order if we have previously used that order in a project, but often of use is to have the bars displayed in order of the y axis value- in this case “Number of posts”

Bar chart showing the number of messages per category

Figure 2.5: Bar chart showing the number of messages per category

A key issue with bar plots involves readability when there are lots of categories. If these categories lead to bars of similar lengths, it can lead to a phenomenom known as the Moiré effect and cause confusion and difficulty focusing. This can make it challenging to distinguish between individual bars and interpret the data accurately.

Bar chart showing the number of messages per category

Figure 2.6: Bar chart showing the number of messages per category

In the bar chart above, we see an example of the Moiré effect. This is when we start to see “patterns” that don’t exist due to your senses being overloaded by what your eyes are taking in. In addition, charts can look chunky and busy with more than ~ 8 bars - even if they are of varying sizes.

2.2.1.1.1 Lollipop

To combat this, lollipop charts can be used to show the same data:

Lollipop chart showing the number of messages per category

Figure 2.7: Lollipop chart showing the number of messages per category

Hopefully the lollipop chart is a bit clearer for your eyes. These charts are also good alternatives for bar charts in general if one feels that they are using too many bar charts in a deck, as they provide a much nicer use of white space than some busy bar charts.

A key issue with lollipop charts vs bar charts is how important it is for the client to know the exact value of each category. In bar charts, it is clear where the top of the bar is, whereas for a lollipop chart, is the centre of the circle the value in question (a bit imprecise), or is it the top of the circle (or bottom of the circle)?? Whilst this can be combated by adding the raw values to the lollipop chart directly, it is important to consider who will be viewing the chart and the key message we want to hit home (i.e. general trends or specific values).

2.2.1.2 Pie Chart

Whilst pie charts are commonly used and there is comfort to be had in the familiar, they are rarely the best option to our data. Despite this, we often get requests to also demonstrate this same data (comparing categories) with pie charts.

Pie charts are most effective when values are around 25%, 50%, or 75%- we can interpret these percentages much easier in a pie chart than a stacked bar.

test test

Figure 2.8: test test

Pie charts are not a good choice if we really want the client to compare the size of segments

test test

Figure 2.9: test test

This can be alleviated by adding values directly on to the plot, but we will get to that later in the cookbook.

2.2.1.3 Doughnut Chart

The sibling of the Pie Chart, a doughnut chart displays the same information as a pie chart though the centre is removed. This allows information to be reported within the centre of the chart itself. However, whereas pie charts are designed so the proportion of slices represent the value of interest, for doughnut charts it is the length of each arc that is representative of the value we are presenting. The same precautions should be kept in mind when making and presenting doughnut chart as mentioned above for pie charts (in fact, if anything, the number of categories that can appropriately be displayed is lower- up to 5).

Comparison between a Pie Chart and a Doughnut Chart showing the same data

Figure 2.10: Comparison between a Pie Chart and a Doughnut Chart showing the same data

2.3 Comparing sub-categories within categories

We may need a stacked bar chart when we want to go one step further than a simple bar chart. That is, when we also want to see the relative composition of each bar based on the levels of a second categorical variable.

A key business question that may be answered with this is to visualise how sentiment varies between topics, or brands (again, any categorical variable).

2.3.1 Example

Stacked bar chart showing sentiment volume per category

Figure 2.11: Stacked bar chart showing sentiment volume per category

2.3.2 Interpretation

Interpretation for stacked bar charts requires careful consideration. It may not be clear straight away to a client whether each stack starts from the same baseline as purple in our case (i.e. at 0), or if it starts from the top of the stack below (which it does in our case). It can also be difficult to directly compare stacks between categories. For the baselines stack (purple), it is clear which category contains the largest number of negative posts, however it is much more difficult to compare the neutral and positive stacks as they do not share the same baseline (ask yourself, are there more neutral posts in category C or D- it’s not clear is it?). Finally, stacked bar charts can also be very difficult to interpret when total bar size varies greatly as the individual stacks can become squashed.

Stacked bar chart showing sentiment percentage per category

Figure 2.12: Stacked bar chart showing sentiment percentage per category

Therefore we can scale our data so that rather than showing the raw volumes, we can show the proportion/percentage of posts of each sentiment per category. This means it is much easier to see the sentiment make up of categories with lower volume, but also runs the risk of clients not appreciating that each bar in reality represents vastly different volumes.

The way around this is clear explanation of what the chart is showing, whether it is volume or proportion, and determining a story that you are telling and focussing on that (such as comparing between categories or just within)

2.4 Visualise distribution of a continuous variable

Sometimes we will want to understand the the shape of the distribution of a continuous variable in our data. This could be if we have calculated valence using the ParseR package. Whilst this kind of plot is often kept away from clients and instead used for exploratory data analysis (EDA), it is still an extremely powerful and useful string to have in your bow.

2.4.1 Example

Histogram showing the (normal) distribution of sentiment

Figure 2.13: Histogram showing the (normal) distribution of sentiment

2.4.2 Interpretation

Histograms contain a series of bars, where each bar represents a range of values for our variable we are visualising (in this case the x axis of Sentiment Score). When interpreting a histogram, it is important to pay attention to the shape, centre, and spread of the distribution. The shape can be normal (symmetric), skewed to the left or right, or bimodal (two peaks). The centre can be represented by the mean or median, while the spread can be represented by the standard deviation or range. Histograms can be useful for identifying outliers, understanding the range and variability of the data, and comparing the distribution of a variable across different groups. Similar to our “trends over time” plots, when creating a histogram it is important to choose an appropriate bin width or range of values for each bar, as this can affect the interpretation of the distribution.

For clarity, here are some histograms showing a left-skewed, right-skewed, and bimodal distribution:

Histograms revealing different underlying distributions within data

Figure 2.14: Histograms revealing different underlying distributions within data

Form these, our interpretion should be that the left-skewed distribution appears as a curve that is skewed to the left, with most values clustered towards the upper end of our scale (closer to 1). The right-skewed distribution appears as a curve that is skewed to the right, with most values clustered towards the lower end (closer to 0). The bimodal distribution appears as two distinct peaks, with most values clustered around two different means (0.3 and 0.7 in this example).

2.5 Visualise most frequent terms

The aim of a bigram network is to get a high level understanding of a specific conversation by finding highly frequent bigrams.

N.B. The term bi-gram refers to a sequence of two adjacent terms. If we examine three adjacent terms, this would be referred to as a tr-igram, and so on. In other words a bigram is an n-gram where n = 2. The plot we present is known as a bigram network that displays bigrams. Whilst it may seem like splitting hairs, it is important to be precise in our terminology.

2.5.1 Example

Bigram network showcasing frequently appearing bigram within a dataset

Figure 2.15: Bigram network showcasing frequently appearing bigram within a dataset

2.5.2 Interpretation

The interpretation of a bigram network is much trickier than the simple plots we have so far introduced. Each term is represented by both a label and a node (circle), with edges (arrows) between terms representing the direction a bigram should be read. The colour of the nodes and edges represents term and bigram frequency, respectively. The size of the nodes also represent term frequency, and can be a good easy stepping point into identifying highly frequent terms. The physical location of the nodes does not mean anything, they are placed by an algorithm based on what they are connected to. For example, the bigram “join us” is no more similar to “every day” as it is to “annual hispanic”.

For a more detailed overview of bigrams and more specific business-specific interpretation, please see the ParseR vignette

2.6 Differences in language between categories

Comparing language use between groups/categories. These categories could be audiences, posts of differing sentiment, posts from different quarters etc.

2.6.1 Example

2.6.2 Interpretation

Before getting bogged down by the statistical interpretation of WLO, let’s think about this from a data visualisation point of view. WLO charts are simply scatter plots, where the x axis shows word frequency, and the y axis shows the log odds ratio. Each point on the scatterplot is a word, and we apply a label to each point to identify words. Things get a little trickier when we take a look at the x axis and realise it is on a logarithmic scale- meaning the distance between 1 and 10 on this scale would be the same as between 10 and 100 (note though for clarity the WLO x-axes do not start at 0).

For a more detailed overview of WLO and important interpretation from a statistical perspective, please see the ParseR vignette

In other words, it’s important to remember that the magnitude of a WLO value reflects the strength of the association, but it is not directly interpretable as a probability or frequency. Rather, it reflects the logarithmic difference between two probabilities (or odds), and should be treated as a relative measure of association.

Therefore, when reporting WLO to clients, one must refrain from using phrases such as “This term is X times as likely to appear in Category A than Category B and C”, and instead use phrases such as “This term has a stronger association with Category A than Category B and C”.

2.7 Comparing two categorical variables with a metric of interest

Heatmaps are extremely useful charts that use colours to enable us to observe patterns in the value of a metric for one or two categorical variables. They are extremely versatile and can be used for a variety of different uses but all rely on examining the intensity of colours in diferent areas of the heatmap to answer a specific business question.

For example, they could be used to see in which topics certain brands appear more frequently in or whether there specific time periods where discussions are more intense?

2.7.1 Example

Example heatmap showing branded conversation within different topics

Figure 2.16: Example heatmap showing branded conversation within different topics

2.7.2 Interpretation

The above heatmap represents the distribution of brand mentions across different topics. That is, if we were to add the values of each cell together, each row would equal 100%, but each column wouldn’t.

The x-axis represents different brands, whereas the y-axis represents different topics. Each cell is coloured and represents the percentage of conversation within each topic that includes mention of a specific brand.

Heatmaps are useful at representing a generalised view of the data, rather than an overly precise representation. As such, during interpretation general patterns should be observed rather than specific values referenced. For example, we can see that Brand A is very popular (darker colours), and is highly mentioned in the conversations for all the Topics except Topic 4. Indeed, we can see that Brand E is very prevalent in Topic 4, and dominates the branded conversation here (darker colour). Conversely, we can see that the brands D (and to a lesser extent F) appear in very low percentages in all of the topics (due to the lighter colours in column Brand D).

Another example to help interpretation of heatmaps is seen below. This heatmap effectively acts like a calendar, with each row being a different day and each column being an hour of the day. We can treat this plot as showing the proportion of branded conversation occuring at different times during the week. Straight away there are two clear patterns we can see here, for each day, the majority of users posting about this brand do so between the hours of 16:00 and 21:00 (darker colours - reading rows from left to right); and across days we can see Saturday and Sunday having more posts at most hours than weekdays (darker colours when reading columns up and down).

Example heatmap showing hourly social media brand mentions for each day of the week

Figure 2.17: Example heatmap showing hourly social media brand mentions for each day of the week

3 Chart Aesthetic Tips

The following are good data viz practices that should be kept in mind whenever we make a chart. Clients may request charts that go against these practices and principles, but it is important to be aware of such principles in our quest for making beautiful looking Capture Intelligence plots.

3.1 Adding values to plot directly

Sometimes we might want a really clean looking plot, or fully transparency of the exact value being visualised is paramount.

In this case, also including the raw values directly onto a figure can be extremely useful. Here we take the same bar chart we introduced earlier in the Cookbook but include the specific values that each of our bars represents as a label. Because we include these values, we can also remove our y axis as well as any plot gridlines as they no longer help us discern more information from the plot.

Bar chart showing the number of messages per category

Figure 3.1: Bar chart showing the number of messages per category

As mentioned in the pie chart section (REF), adding labels is highly recommended for pie charts and donught charts:

Comparison between a Pie Chart and a Doughnut Chart showing the same data

Figure 3.2: Comparison between a Pie Chart and a Doughnut Chart showing the same data

3.2 Adding labels to the plot directly

Similarly, it is often good practice to add a category label directly to the plot too to avoid having a legend. This is because in general legends take too long to read (ones eyes have to go back and forth between legend and plot), they don’t work great with many colours, and decrease accessibility of our plots.

For example, let’s see a plot where we have a legend:

Line chart showing daily trend of three categories

Figure 3.3: Line chart showing daily trend of three categories

Notice how you have to keep looking between the plot and the legend to fully understand which line represents which category?

When we plot the category label directly on the figure, this mental load is removed:

Line chart showing daily trend of three categories

Figure 3.4: Line chart showing daily trend of three categories

3.3 Gridlines and borders

We should aim to remove clutter from plots to strive for clean and elegant plots. I highly recommend anyone interested in data visualisations to read the works of Edward Tufte, but one of his most famous opinions is to “Maximise the data-ink ratio, within reason”. Through this, Tufte proposes a minimalistic approach to data visualisation by removing most parts of a plot which do not display the data itself. This is certainly one extreme of data visualisation aesthetics, and I think a middle balance between including too much and too little ‘non-data ink’ (anything on a chart that doesn’t display the actual data).

With gridlines the advice should be:

  • Gridlines that run perpendicular to the variable of interest are the most useful
  • We do not need gridlines that go from the x axis on a bar chart where the x axis. is a category (the bars themselves guide our eye the same way a gridline would).
  • When the data values are specifically displayed on the chart (as in Figure 3.1) then we do not need gridlines to help interpret these values.
  • Gridlines should not be the same colour as the axis lines or font. A light grey balances utility and overpowering the plot.

The advice surrounding borders:

  • If the plot you are making is faceted (i.e. made of up of lots of little plots as in the case of WLO), then a border should be included around each plot to clearly show which data and information in contained in each individual plot.
  • If the plot is stand alone, but requires axis, then only the axes of interest (normally x and y axes) should be drawn, with no outer border.
  • Plots that do not contain any axes (e.g. a bigram) should not have a border around them.

Despite this, consistency is the name of the game here. If for whatever reason a client asks for a border to be around a plot (or a border removed), all similar plots in the deck should follow the same aesthetic.

3.4 Colours

The choice of colours is one of the most important aspect of any good visualisation, with incorrect usage turning a slick visualisation to a hot uninterpretable mess of sadness.

The Capture Intelligence colour palette is based on the viridis palette- a series of colour maps designed to improve graphic readability for those with common forms of colour blindness and/or colour vision deficiency. Plus these colours are super pretty.

Despite this, we are often tasked with using colour palettes that match the client we performing the work for. This section will not inform how to create your own specific colour palette, nor will it go into detailed colour theory, but rather aims to empower you to be able to make appropriate decisions on the best colours to chose for different visualisations.

Broadly, there are three different types of colour palettes one can use to display different types of data:

  • Qualitative
  • Diverging
  • Sequential

3.4.1 Qualitative colour palettes

These palettes are best used to represent values of distinct categories that do not have an intrinsic order. As such, they are appropriate for line charts, bar charts, pie charts, doughnut charts.

These colours are different hues (i.e. different colours), and are sometimes called unordered colour scales. In these scales, no colour is worth more or less than any other colour.

The default Microsoft colour palette we use is an example of a qualitative (or discrete) colour palette:

Example qualitative colours

Figure 3.5: Example qualitative colours

3.4.2 Sequential colour palettes

These palettes use multiple shade variations- effectively going from a light shade to a dark shade. They are suitable for representing numbers that go from low to high. This means that a reader can see a value represented by a “light colour” and inherently understand that this represents a lower value than a “darker colour”, without even having to look at a legend yet.

Whilst you can use only one colour (e.g. light purple to dark purple), using multiple colours (blue to dark purple) increases the colour contrast and makes it easier to distinguish between values.

Example sequential colours

Figure 3.6: Example sequential colours

3.4.3 Diverging colour palettes

These palettes are best used when we want to represent a scale around a central value (i.e. a meaningful middle value such as zero, an average, a threshold, a target etc). Whereas sequential colour palettes go from low to high, diverging palettes utilise a neutral colour in the middle of the scale, with two opposite colours with varying shades diverging from this central value. These palettes are often use to visualise negative and positive values or Likert scales. There are two big advantages to using diverging scales: they emphasize the extremes, and they let readers see more differences in the data.

An example of this could be scores of valence than range from 1 (positive) to -1 (negative) with a central value of 0 (neutral).

Example divergent colours

Figure 3.7: Example divergent colours

As you can see, the difference between sequential and divergent palettes is very nuanced (especially with many data values), and deciding between the two should be a considered choice. If you want to emphasise the highest values, use a sequential scale, if you want to emphasise the lowest and highest values, use a diverging scale.

To show why these different palettes are important, let’s see them in some example plots:

When to use qualitative colour palette

Figure 3.8: When to use qualitative colour palette

Here we see how a simple bar chart can look vastly different when using different palettes. Qualitative colour palettes maximise the distinction between categories making it easy to different groups at a glance. Similarly, the use of a qualitative palette ensure the colours do not imply any inherent ordering or hierarchy. The sequential colour palette unintentionally convey a perceived hierarchy or sequence that doesn’t exist and subtle differences in colour shades can make it difficult for viewers to distinguish between different categories. Applying a divergent palette to unrelated categories can create a false sense of order or relationship between them. The stark contrast in colours may also draw attention away from the actual values being compared, leading to misinterpretation or confusion.

Despite this, sometimes using a sequential palette can be okay for such plots when we want to emphasise an underlying order. Remember when we said that we could rearrange a chart so the categorical values follow the order of the variable of interest? In this case, using a sequential colour palette actually helps to double-encode the value of “number of posts” by both position and colour

When to use qualitative colour palette

Figure 3.9: When to use qualitative colour palette

Here we can see that to be honest the sequential bar chart with the sequential palette is easier to read than the colourful and overwhelming qualitative palette.

Below is another example, this time visualising a heatmap. We can see the value we are visualising with colour (percentage) has a clear order. Using our qualitative palette, which has no inherent order, produces a lego-like mess where it is not clear what each colour represents. The divergent palette is slightly better, but still creates the impression there is a significant distinction between 10% and 40%, when in fact they form a continuous range. The best approach here is the sequential palette that provides a smooth progression of colours to represent the increasing percentages. This ensures a more accurate representation of the data and help users perceive the gradual change in values without introducing unnecessary confusion or bias.

When to use sequential colour palette

Figure 3.10: When to use sequential colour palette

Despite being “text” rather than “numbers, something like a Likert scale has inherent order to it and hence a qualitative colour palette is unsuitable for this particular visualization. Whilst this palette enables easy differentiation between groups at a glance, it fails to represent the underlying scale or intensity of the Likert scale. The sequential colour palette is also unsuitable. Its subtle differences in colour shades can make it challenging for viewers to distinguish between the different categories accurately. The sequential palette may inadvertently suggest a progression or intensity within the Likert scale, misleading the interpretation of the data. In contrast, the divergent colour palette can be considered suitable for the Likert scale stacked bar charts. By using a divergent palette, meaningful thresholds or midpoints within the scale can be highlighted effectively. It enables the representation of both positive and negative values, accentuating the contrast between categories.

When to use divergent colour palette

Figure 3.11: When to use divergent colour palette

*Highlighting key pieces